{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "7bbbab51fb95a3cf",
   "metadata": {},
   "source": [
    "# 第35章 马尔可夫决策过程"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "668ff99f",
   "metadata": {},
   "source": [
    "## 习题35.1"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a5664b09",
   "metadata": {},
   "source": [
    "&emsp;&emsp;假设奖励属于有限集合$R \\in \\mathcal{R}$，折扣因子$\\gamma$小于1，证明回报一定有界。\n",
    "$$\n",
    "G_t = \\sum_{k=0}^{\\infty} \\gamma^k R_{t+k+1}\n",
    "$$\n",
    "提示：利用以下关系。\n",
    "$$\n",
    "\\sum_{k=0}^{\\infty}  \\gamma^k = \\frac{1}{1 - \\gamma}, \\gamma < 1\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f4fad61f",
   "metadata": {},
   "source": [
    "**解答：**  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "84b11f75",
   "metadata": {},
   "source": [
    "**解答思路：**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a428a277",
   "metadata": {},
   "source": [
    "**解答步骤：**  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "559ea733",
   "metadata": {},
   "source": [
    "## 习题35.2"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a732f1fa",
   "metadata": {},
   "source": [
    "&emsp;&emsp;写出有限期MDP的一个轨迹上各个状态的回报的递归计算算法。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "83e98a6c",
   "metadata": {},
   "source": [
    "**解答：**  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "69692b61",
   "metadata": {},
   "source": [
    "**解答思路：**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a66428db",
   "metadata": {},
   "source": [
    "**解答步骤：**  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "51729a36",
   "metadata": {},
   "source": [
    "## 习题35.3"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7c5b25ff",
   "metadata": {},
   "source": [
    "&emsp;&emsp;例35.1的问题中，计算策略是$\\pi_2$时的轨迹$t_1$和$t_2$的回报和概率。验证策略是$\\pi_1$时轨迹$t_3$和$t_4$不存在。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "617cda92",
   "metadata": {},
   "source": [
    "**解答：**  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a7a2b2bb",
   "metadata": {},
   "source": [
    "**解答思路：**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bb11a4a1",
   "metadata": {},
   "source": [
    "**解答步骤：**  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fe9d775b",
   "metadata": {},
   "source": [
    "## 习题35.4 "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6f7ae85c",
   "metadata": {},
   "source": [
    "&emsp;&emsp;推导贝尔曼期望方程的公式。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "81a05e54",
   "metadata": {},
   "source": [
    "**解答：**  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "32b61b8b",
   "metadata": {},
   "source": [
    "**解答思路：**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dcc9fea7",
   "metadata": {},
   "source": [
    "**解答步骤：**  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "98827429",
   "metadata": {},
   "source": [
    "## 习题35.5"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "08e7b6fb",
   "metadata": {},
   "source": [
    "&emsp;&emsp;证明关于策略评估的定理35.1。提示：参考定理35.4的证明，利用算子$H(V) = R + \\gamma PV$的压缩映射性质。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5a108be6",
   "metadata": {},
   "source": [
    "**解答：**  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d3069c2a",
   "metadata": {},
   "source": [
    "**解答思路：**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "45843efd",
   "metadata": {},
   "source": [
    "**解答步骤：**  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "91652c83",
   "metadata": {},
   "source": [
    "## 习题35.6"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "40fa1b4f",
   "metadata": {},
   "source": [
    "&emsp;&emsp;例35.1的问题中，用策略评估算法求策略$\\pi_1$和$\\pi_2$的价值函数，并做比较。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "31fee4e2",
   "metadata": {},
   "source": [
    "**解答：**  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f920e79a",
   "metadata": {},
   "source": [
    "**解答思路：**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "157fd96e",
   "metadata": {},
   "source": [
    "**解答步骤：**  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "abc6e27e",
   "metadata": {},
   "source": [
    "## 习题35.7"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a80708b5",
   "metadata": {},
   "source": [
    "&emsp;&emsp;图35.10中策略逐渐收敛到$\\pi_{\\infty}$，用策略评估算法求策略$\\pi_{\\infty}$的价值函数。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4dbe127e",
   "metadata": {},
   "source": [
    "**解答：**  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dbd68046",
   "metadata": {},
   "source": [
    "**解答思路：**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dbf3cea8",
   "metadata": {},
   "source": [
    "**解答步骤：**  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6a5b5bc2",
   "metadata": {},
   "source": [
    "## 习题35.8"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cdb40ff4",
   "metadata": {},
   "source": [
    "&emsp;&emsp;例35.1的问题中，从起点到终点的路径规划也可以用$A^*$算法求解。比较与基于MDP的价值迭代算法的区别，理解环境是确定和随机的意义。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ca3fe157",
   "metadata": {},
   "source": [
    "**解答：**  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a36d57a8",
   "metadata": {},
   "source": [
    "**解答思路：**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d889da26",
   "metadata": {},
   "source": [
    "**解答步骤：**  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b7309899",
   "metadata": {},
   "source": [
    "## 习题35.9"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "667e9736",
   "metadata": {},
   "source": [
    "&emsp;&emsp;例35.1的问题中，假设状态的奖励如下图所示，其他条件不变。用价值迭代算法求解各个状态的最优价值。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "82fcb540",
   "metadata": {},
   "source": [
    "![](../images/35-1.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5b307aec",
   "metadata": {},
   "source": [
    "**解答：**  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8ded4899",
   "metadata": {},
   "source": [
    "**解答思路：**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d3bb86ab",
   "metadata": {},
   "source": [
    "**解答步骤：**  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ba504121",
   "metadata": {},
   "source": [
    "## 习题35.10"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e5331f2e",
   "metadata": {},
   "source": [
    "&emsp;&emsp;使用价值迭代算法求例35.2中的机器人行动规划的最优价值函数和最优策略。状态的集合$\\mathcal{S} = \\{h, l\\}$，动作的集合$\\mathcal{A} = \\{w, s, c\\}$，其中$h$表示电量高，$l$表示电量低，$w$表示工作，$s$表示待机，$c$表示充电。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dfa2ec8e",
   "metadata": {},
   "source": [
    "| 转移概率 $P(s' | s, a)$ | 奖励函数 $R(s, a, s')$ |\n",
    "| :---: | :---: |\n",
    "| $ P(h | h, w) = 0.8 $ | $ R(h, w, h) = +2 $ |\n",
    "| $ P(l | h, w) = 0.2 $ | $ R(h, w, l) = 0 $ |\n",
    "| $ P(h | h, s) = 1.0 $ | $ R(h, s, h) = 0 $ |\n",
    "| $ P(l | l, w) = 1.0 $ | $ R(l, w, l) = -2 $ |\n",
    "| $ P(l | l, s) = 1.0 $ | $ R(l, s, l) = 0 $ |\n",
    "| $ P(h | l, c) = 1.0 $ | $ R(l, c, h) = +2 $ |"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1942828f",
   "metadata": {},
   "source": [
    "**解答：**  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "850b803d",
   "metadata": {},
   "source": [
    "**解答思路：**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5d457c67",
   "metadata": {},
   "source": [
    "**解答步骤：**  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b1d5ae37",
   "metadata": {},
   "source": [
    "## 习题35.11"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "10212641",
   "metadata": {},
   "source": [
    "&emsp;&emsp;写出使用动作价值函数的策略评估、策略迭代和价值迭代算法。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "37e1948e",
   "metadata": {},
   "source": [
    "**解答：**  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c523c766",
   "metadata": {},
   "source": [
    "**解答思路：**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "539f3d46",
   "metadata": {},
   "source": [
    "**解答步骤：**  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "50c242fd",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
